Please copy and paste the below code chunk in it’s entirety to your console to download R package libraries needed for this practical. If you are having trouble installing any of the R packages, please ask an instructor for a pre-loaded flash drive.
deps <- c("tidyverse", "genescaper", "MIPanalyzer","rmaverick")
deps <- !sapply(deps, function(x){x %in% installed.packages()[,1]})
if (any(deps)) {
if (deps["tidyverse"]) {
install.packages("tidyverse")
}
if (deps["genescaper"]) {
devtools::install_github("mrc-ide/genescaper", ref = "v0.1.0")
}
if (deps["MIPanalyzer"]) {
devtools::install_github("mrc-ide/MIPanalyzer", ref = "v1.0.0")
}
if (deps["rmaverick"]) {
devtools::install_github("bobverity/rmaverick", ref = "v1.1.0")
}
}Now load all of those libraries into this session using the code chunk below. Please copy and paste it in its entirety.
library(tidyverse)
library(genescaper)
library(MIPanalyzer)
library(rmaverick)Finally, source the additional functions that are needed for this practical by copy-pasting this function:
source("source_functions/pop_structure_utils.R")One of the most fundamental observations that we can make from genetic data is how the frequencies of different alleles vary over time and over space. For loci where mutations confer drug resistance, allele frequencies tell us something with direct clinical implications - if an allele is at high frequencies then we might consider switching first-line drugs. However, even for loci that have no direct impact on phenotype, and therefore are selectively neutral, allele frequencies can give us important information about what is going on in the parasite population. These loci will be influenced by factors such as prevalence, human and mosquito population size, and migration rates between different sub-populations. If we can correctly interpret these signals then we can use neutral allele frequencies to tell us things about population structure and connectivity that can be relevant for control purposes.
We will work with both simulated and real data in this practical. The simulated data will allow us to explore how allele frequencies change over time as a function of different underlying parameters - something that we can almost never do from messy real-world data. This will allow us to build up an intuition about what we would expect to see in real data. In the second half of the practical, we will use real-world datasetes including a large molecular inversion probe (MIP) dataset taken from DRC and surrounding countries, previously analysed by Verity et al. 2020. We will use this to explore population structure in various ways.
By the end of this practical, you should be able to:
Explanation of the term “deme”?
Try running this function lots of times. What do you notice about the allele frequency over time?
sim1 <- sim_freqs(N = 1000, mut_rate = 0)## simulation completed in 0.00210409 seconds
Now repeat with a smaller population size (which represents the number of infected hosts in our case). What do you notice about the strength of drift? What happens once an allele reaches a frequency of 0 or 1? Why is this?
sim_freqs(N = 100, mut_rate = 0)## simulation completed in 0.00110296 seconds
Now repeat the simulation, this time running for 10 years and with some mutation. Do frequencies still always end up fixed/lost?
sim_freqs(N = 1000, t_out = seq(0, 365 * 10), mut_rate = 1e-4)## simulation completed in 0.0127457 seconds
We will now simulate a large number of loci, but output at just a single timepoint.
# simulate allele frequencies and look at the data
sim1 <- sim_freqs(N = 1000, t_out = 365 * 5, mut_rate = 1e-6, loci = 1e3, plot_on = FALSE)## simulation completed in 0.165925 seconds
head(sim1$data)## time deme locus freq
## 1 1825 1 1 1.000
## 2 1825 1 2 1.000
## 3 1825 1 3 0.130
## 4 1825 1 4 0.306
## 5 1825 1 5 1.000
## 6 1825 1 6 1.000
We can plot this allele frequency spectrum
# simulate allele frequencies and look at the data
ggplot(sim1$data) +
geom_histogram(aes(x = freq), breaks = seq(0, 1, 0.01))What happens if you increase the mutation rate? What happens if you increase the population size?
Run the code below multiple times to repeat the simulation of drift, but now also plotting homozygosity. What do you notice about how homozygosity changes over time? (It always increases to 1).
sim_freqs(N = 1000, t_out = seq(0, 365 * 10, 7), plot_homo = TRUE)## simulation completed in 0.00173731 seconds
What happens when you add in some mutation? Try mutation rates of 1e-3 and 1e-4. What happs to the equilibrium level of homozygosity?
sim_freqs(N = 1000, t_out = seq(0, 365 * 10, 7), mut_rate = 1e-4, plot_homo = TRUE)## simulation completed in 0.00180051 seconds
We can add in a line showing the theoretical expectation.
sim_freqs(N = 1000, t_out = seq(0, 365 * 5, 7), mut_rate = 1e-4, plot_homo = TRUE, plot_expected_homo = TRUE)## simulation completed in 0.000612286 seconds
Try running 10 independent subpopulations. Notice that allele frequencies tend to diverge
sim1 <- sim_freqs(N = 1000, demes = 10, mig_rate = 0, mut_rate = 1e-4)## simulation completed in 0.00453362 seconds
We can calculate a simple version of Fst manually
sim_fst <- sim1$data %>%
select(-locus) %>%
group_by(time) %>%
summarise(p_mean = mean(freq),
p_var = var(freq),
Fst = p_var / (p_mean * (1 - p_mean)))Now lets plot Fst over time. What do you notice about differentiation?
ggplot(sim_fst) +
geom_line(aes(x = time, y = Fst)) +
expand_limits(y = c(0, 1))What happens if you repeat the process above with a higher mutation rate?
sim1 <- sim_freqs(N = 100, demes = 10, mig_rate = 0, mut_rate = 1e-3,
t_out = seq(0, 365*10, 7), plot_on = FALSE)## simulation completed in 0.0049587 seconds
sim1$data %>%
select(-locus) %>%
group_by(time) %>%
summarise(p_mean = mean(freq),
p_var = popvar(freq),
Fst = p_var / (p_mean * (1 - p_mean))) %>%
ggplot() + theme_bw() +
geom_line(aes(x = time, y = Fst)) +
expand_limits(y = c(0, 1))The equilibrium level of Fst is known to be 1/(1 + 2Nmu) in this case. How does this level compare with what you saw in the plot?
# e.g. use
#geom_hline(yintercept = 1 / (1 + 2*1e2*1e-3), linetype = "dashed")The fact that Fst depends on the mutation rate can lead to problems. If one marker has a high mutation rate (e.g. microsat) then we would expect to see low values of Fst, even if the populations have been evolving separately for a long time, and in this sense are well differentiated.
Jost’s D is an alternative way of looking at differentiation.
sim1 <- sim_freqs(N = 100, demes = 10, mig_rate = 0, mut_rate = 1e-3,
t_out = seq(0, 365*10, 7), plot_on = FALSE)## simulation completed in 0.00471879 seconds
sim1$data %>%
select(-locus) %>%
group_by(time) %>%
summarise(Js = mean(freq^2 + (1 - freq)^2),
Jt = mean(freq)^2 + mean(1 - freq)^2,
D = (Js - Jt) / Js * max(deme) / (max(deme) - 1)) %>%
ggplot() + theme_bw() +
geom_line(aes(x = time, y = D)) +
expand_limits(y = c(0, 1.2))Jost’s D does not suffer from the same problem as Fst, but it suffers from a different problem - it assumes infinite alleles mutation. When we have a biallelic SNP, Jost’s D will tend to reach values of around 0.5 at equilibrium. (This is actually a bit of a subtle point - maybe easier to understand if I generalise the above to allow more than two alleles?)
Conclusion - both Fst and D have their limitations. Fst is good when mutation rates are low, for example biallelic SNPs. Jost’s D is good for highly diverse and fast-mutating sites, such as microsats.
What happens when we connect our demes up by migration? Try experimenting with larger and smaller migration rates in the following. What do you notice about the allele frequencies? (lines tend to be less bumpy and move together when migration is high)
sim_freqs(N = 1000, demes = 5, mig_rate = 0.01)## simulation completed in 0.00437263 seconds
Lets look at what happens to Fst when we include migration. What happens when you increase/decrease the migration rate?
sim1 <- sim_freqs(N = 1000, demes = 10, t_out = seq(0, 365*20, 7), mig_rate = 1e-3,
mut_rate = 1e-6, plot_on = FALSE)## simulation completed in 0.0276048 seconds
sim1$data %>%
select(-locus) %>%
group_by(time) %>%
summarise(p_mean = mean(freq),
p_var = popvar(freq),
Fst = p_var / (p_mean * (1 - p_mean))) %>%
ggplot() + theme_bw() +
geom_line(aes(x = time, y = Fst)) +
expand_limits(y = c(0, 1)) +
geom_hline(yintercept = 1 / (1 + 2*1e3*(1e-3 + 1e-6)), linetype = "dashed")Theory tells us that the equilibrium level of Fst in this situation is given by 1/(1 + 2N(m + mu)). Does this result hold up this your simulations? (yes, on average). Note that migration rates are usually much larger than mutation rates, meaning we can often simplify this to 1/(1 + 2Nm).
Conclusion - at equilibrium, Fst reaches a level that is a balance between the strength of genetic drift (i.e. small population size) and migration. The higher the migration rate, the lower the value of Fst. This means we can use Fst to tell us something about the connectivity of populations. Small pairwise Fst tends to indicate highly connected populations and vice versa.
Note, however, that these patterns take a very long time to build up. So we are not talking about the timescale of control programmes, but more of an evolutionary timescale.
Load the DRC MIP data. This data is a slightly simplified version of that used in the NatComms paper.
Have a look at the data using the MIPanalyzer package.
# load data
MIP_data <- readRDS("data/DRC_MIPs_biallelic_processed.rds")
# have a look at the data
MIP_data
head(MIP_data$samples)## ID Country Admin1_name Year Latitude Longitude Cluster
## 07GHR5002-AG-1 07GHR5002 Ghana <NA> 2013 6.385908 -0.376496 1
## 07GHR5006-AG-1 07GHR5006 Ghana <NA> 2013 6.385908 -0.376496 1
## 07GHR5018-AG-1 07GHR5018 Ghana <NA> 2013 6.385908 -0.376496 1
## 07GHR5025-AG-1 07GHR5025 Ghana <NA> 2013 6.385908 -0.376496 1
## 07GHR5028-AG-1 07GHR5028 Ghana <NA> 2013 6.385908 -0.376496 1
## 07GHR5029-AG-1 07GHR5029 Ghana <NA> 2013 6.385908 -0.376496 1
head(MIP_data$loci)## CHROM POS REF ALT NEUTRAL GEO
## 9 1 100608 A G Neutral Non-geographic
## 19 1 138823 C T Neutral Non-geographic
## 31 1 139191 C T Non-neutral Geographic
## 35 1 140820 A C Non-neutral Geographic
## 36 1 155939 G A Neutral Non-geographic
## 63 1 182927 T C Neutral Non-geographic
let’s create a copy of the dataset that does not include any DRC samples:
# filter samples
MIP_data_noDRC <- MIPanalyzer::filter_samples(MIP_data, MIP_data$samples$Country != "DRC")
MIP_data_noDRCNow lets do PCA on the within-sample allele frequencies
wsaf_impute <- MIPanalyzer::get_wsaf(MIP_data_noDRC, impute = TRUE, FUN = mean)
pca <- MIPanalyzer::pca_wsaf(wsaf_impute)
plot_df <- data.frame(PC1 = pca$x[,1],
PC2 = pca$x[,2],
Country = MIP_data_noDRC$samples$Country)
ggplot(plot_df) + theme_bw() +
geom_point(aes(x = PC1, y = PC2, color = Country))We see well-separated clusters, which makes sense geographicall. Now let’s repeat the process using all samples including those from DRC.
wsaf_impute <- MIPanalyzer::get_wsaf(MIP_data, impute = TRUE, FUN = mean)
pca <- MIPanalyzer::pca_wsaf(wsaf_impute)
plot_df <- data.frame(PC1 = pca$x[,1],
PC2 = pca$x[,2],
Country = MIP_data$samples$Country)
ggplot(plot_df) + theme_bw() +
geom_point(aes(x = PC1, y = PC2, color = Country))The DRC points connect the other points. This suggests that there is more of a spatial continuum of allele frequencies.
We can plot the contributions that each locus makes to this PCA. Starting with PC1, what do you notice about the balance of geographically informative vs. non-informative SNPs? (the pattern is mostly driven by geographically informative SNPs). What does this suggest? (the primary factor driving this PCA is spatial location)
# plot component 1
MIPanalyzer::plot_pca_contribution(pca, component = 1, chrom = MIP_data$loci$CHROM, pos = MIP_data$loci$POS,
locus_type = MIP_data$loci$GEO)Now repeat the process with PC2. What do you notice about where these loci are on the genome? (they are all in one or two places). What genes are at these positions? (pfcrt on chrom7. Small peak at dhps on chromosome 8)
# plot component 2
MIPanalyzer::plot_pca_contribution(pca, component = 2, chrom = MIP_data$loci$CHROM, pos = MIP_data$loci$POS,
locus_type = MIP_data$loci$GEO)Load barcode data from Amy Bei’s 2018 paper. This data has already been filtered to remove the two Thies samples, and to order in terms of date (2001-2002 vs. 2014) and location (Dielmo and Ndiop)
Give a snapshot of the data, which includes measures of transmission intensity in the two timepoints and regions
load("data/Bei_2018_processed.RData")
head(Bei_2018_processed$EIR)## Time range Location EIR
## 1 2001 Dielmo 353.8
## 2 2002 Dielmo 409.9
## 3 2001 Ndiop 171.9
## 4 2002 Ndiop 16.9
## 5 2014 Dielmo 26.3
## 6 2014 Ndiop <0.05
head(Bei_2018_processed$barcodes)## Sample Code Date Sex Age Parasite Density Location M/P genomic A1 B1
## 1 IP09 2001-01-02 F 3.4 9150 Dielmo M C A
## 2 IP07 2001-01-03 F 2.6 20800 Dielmo P T A
## 3 IP03 2001-02-05 M 10.5 6450 Dielmo P N A
## 4 IP05 2001-02-05 F 3.8 22200 Dielmo P C A
## 5 IP01 2001-02-06 M 2.3 10550 Dielmo P N N
## 6 IP08 2002-01-07 F 3.7 29450 Dielmo M T A
## A2 B2 A3 B3 A4 B4 A5 B5 A6 B6 A7 B7 A8 B8 A9 B9 A10 B10 A11 B11 A12 B12
## 1 C C G C A G T T T A T C A A C T C G A T T G
## 2 C T C C G G T N T A C C C C C A C A C N T G
## 3 N T C N N G N T T A C N C N N N A N A T T G
## 4 T C G C A G T C T N T A C C C N A A A N T G
## 5 C N G C N G N C T A N N C C N N A N A N T G
## 6 C C G C G A A T T A T A C C C T A G A C T G
head(Bei_2018_processed$SNP_locations)## SNP code SNP location
## 1 A1 Pf_01_000130573
## 2 B1 Pf_01_000539044
## 3 A2 Pf_02_000842803
## 4 B2 Pf_04_000282592
## 5 A3 Pf_05_000931601
## 6 B3 Pf_06_000145472
Some data processing
# filter out polygenomics
mav_data <- Bei_2018_processed$barcodes %>%
filter(`M/P genomic` == "M")
# convert genotype calls to numeric
for (i in 8:31) {
mav_data[,i] <- match(mav_data[,i], c("A", "C", "T", "G"))
}
head(mav_data)## Sample Code Date Sex Age Parasite Density Location M/P genomic A1 B1
## 1 IP09 2001-01-02 F 3.4 9150 Dielmo M 2 1
## 2 IP08 2002-01-07 F 3.7 29450 Dielmo M 3 1
## 3 IP06 2002-01-14 M 1.1 10000 Dielmo M 3 1
## 4 IP04 2002-02-03 F 11.6 19750 Dielmo M 3 1
## 5 IP17 2001-01-02 M 5.6 13600 Ndiop M 3 1
## 6 IP11 2001-05-02 M 0.7 2150 Ndiop M 3 1
## A2 B2 A3 B3 A4 B4 A5 B5 A6 B6 A7 B7 A8 B8 A9 B9 A10 B10 A11 B11 A12 B12
## 1 2 2 4 2 1 4 3 3 3 1 3 2 1 1 2 3 2 4 1 3 3 4
## 2 2 2 4 2 4 1 1 3 3 1 3 1 2 2 2 3 1 4 1 2 3 4
## 3 3 2 4 2 4 4 3 3 3 1 3 2 1 2 2 1 1 4 1 2 3 4
## 4 3 3 4 2 4 4 3 2 3 1 2 2 1 2 3 1 1 4 1 2 3 4
## 5 3 3 2 2 1 4 3 3 3 1 2 2 1 1 2 3 1 4 1 3 3 4
## 6 2 3 2 4 4 4 3 2 NA 4 3 1 2 1 3 1 2 4 1 2 3 4
Load the data into rmaverick, and make a first model set. Print the project to check everything looks as planned (25 samples, 24 loci etc.)
myproj <- rmaverick::mavproject() %>%
rmaverick::bind_data(df = mav_data, ID_col = 1, data_cols = 8:31, ploidy = 1) %>%
rmaverick::new_set(name = "no admixture model", admix_on = FALSE)
myproj## DATA:
## individuals = 25
## loci = 24
## ploidy = 1
## pops = 1
## missing data = 2 of 600 gene copies (0%)
##
## PARAMETER SETS:
## * SET1: no admixture model
##
## ACTIVE SET: SET1
## model = no-admixture
## lambda = 1
Run MCMC. Users should run this with pb_markdown = FALSE.
myproj <- rmaverick::run_mcmc(myproj, K = 1:8, burnin = 1e3, samples = 1e3,
rungs = 10, GTI_pow = 1.5, pb_markdown = TRUE)## Calculating exact solution for K = 1
## completed in 0.00106051 seconds
##
## Running MCMC for K = 2
## Burn-in phase
##
|
|======================================================================| 100%
## converged within 300 iterations
## Sampling phase
##
|
|======================================================================| 100%
## completed in 0.371569 seconds
##
## Running MCMC for K = 3
## Burn-in phase
##
|
|======================================================================| 100%
## converged within 200 iterations
## Sampling phase
##
|
|======================================================================| 100%
## completed in 0.436496 seconds
##
## Running MCMC for K = 4
## Burn-in phase
##
|
|======================================================================| 100%
## converged within 200 iterations
## Sampling phase
##
|
|======================================================================| 100%
## completed in 0.487195 seconds
##
## Running MCMC for K = 5
## Burn-in phase
##
|
|======================================================================| 100%
## converged within 400 iterations
## Sampling phase
##
|
|======================================================================| 100%
## completed in 0.68694 seconds
##
## Running MCMC for K = 6
## Burn-in phase
##
|
|======================================================================| 100%
## converged within 200 iterations
## Sampling phase
##
|
|======================================================================| 100%
## completed in 0.71706 seconds
##
## Running MCMC for K = 7
## Burn-in phase
##
|
|======================================================================| 100%
## converged within 200 iterations
## Sampling phase
##
|
|======================================================================| 100%
## completed in 0.81376 seconds
##
## Running MCMC for K = 8
## Burn-in phase
##
|
|======================================================================| 100%
## converged within 200 iterations
## Sampling phase
##
|
|======================================================================| 100%
## completed in 1.00172 seconds
##
## Processing results
Some MCMC diagnostic checks. First, check that coupling rates are above 0. Do this for every value of K from 2 to the max value explored.
rmaverick:::plot_mc_acceptance(myproj, K = 8)Now check what the program things about the best value of K. In this case, values around 3-5 are reasonably likely
rmaverick::plot_logevidence_K(myproj)rmaverick::plot_posterior_K(myproj)Produce Structure plot. Add vertical lines to break up into the following groups: 1. Dielmo 2001-2001 2. Ndiop 2001-2002 1. Dielmo 2014 2. Ndiop 2014
rmaverick::plot_qmatrix(myproj, K = 3:5) +
geom_vline(xintercept = c(4, 10, 16) + 0.5, size = 1)Notive that the later samples are (mostly) clearly allocated to one or other population with high probability. These essentially represent clonal lineages. We can see that groups 2,3,4 are present in Dielmo in 2014, and in Ndiop we additionally have group 1. Comparing this to the earlier time point, samples are far more ambigious in their allocation. They look a bit like some of these clonal lineages, but it’s not clear cut, probably because there is some additional variation. Interestingly, group 4 dominates in Ndiop in 2001-2002. This lineage is almost completely lost by the later sampling point. Overall, we can see evidence of increasing population structure over time. Compare this with the EIR values - what do you notice? (lower transmission later in time, therefore increasing differentiation and population structure).